In this part, I would apply Association Rule Mining (ARM) and networking analysis on the text data (Twitter response tweets), the dataset (tweetEN_clean.csv) used for ARM and networkiung can be found here. Moreover, the code of implementing these will also be shown along the process, and it can also be found here.
A network is basically a set of objects connected to each other, similar to a spider web. The connections, or associations of the objects are displayed as the links or edges. The edges/links can be either directed or undirected, and the links can also be either weighted and unweighted. By looking at the directions and the weights of the edges/links, we will be able to dive deeper into the connections and relationships between the objects.
Typically, for N number of objects, there exists N^2 possible connections.
In this tab, a Python package named NetworkX will be used to create, manipulate, and analyze the structure, dynamics, and functions of the networks of the Twitter response text.
Association Rule Mining (ARM) is an unsupervised and rule-based machine learning method for finding inter-relations between variables in large databases based on their statistical relevance. It is worth noting that association measures the co-occurrence of the objects instead of causality.
Given a training dataset, the goal of ARM is to discover rules that will forecast the existence of an object based on the appearance of other objects in the training data. In other words, ARM is designed to uncover the connections in the dataset and how strong those connections are.
Some real life applications of ARM are:
The dataset is the clean comma-seperated values (csv) file, it has 775 rows and 11 variables, including the following values of the Twitter response (Tweets) data: author_id (Tweet author ID), id (Tweet ID), created_at (content/tweet created date and time), text (original tweet content), clean_text (clean tweet content after punctuations, special characters, etc being removed), tweet_tokenized (tokenized clean tweet content), tweet_nonstop (tokenized clean tweet with stop words removed), tweet_stemmed (stemmed clean tweet content), tweet_lemmatized (lemmatized clean tweet content), sentiment (sentiment value of the tweet), label (labels generated based on the tweet sentiment value).
However, the only variable that will be used for the Association Rule Mining (ARM) and network analyis is tweet_lemmatized, which is the lemmatized tweet content after the text cleaning process. In other words, the lemmatized tweet tokens will be used as the "transaction data" here in order to explore the associations of the text data.
### Import Relevant Packages
import json
import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from apyori import apriori
import networkx as nx
Below is the snapshot of how the clean tweet dataset looks like:
tweetDF = pd.read_csv("tweetEN_clean.csv")
tweetDF.head()
| author_id | id | created_at | text | clean_text | tweet_tokenized | tweet_nonstop | tweet_stemmed | tweet_lemmatized | sentiment | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1116548763168858112 | 1575191634026717184 | 2022-09-28T18:32:05.000Z | RT @VSkirbekk: Gradual convergence in fertilit... | gradual convergence fertility between china we... | ['gradual', 'convergence', 'fertility', 'betwe... | ['gradual', 'convergence', 'fertility', 'china... | ['gradual', 'converg', 'fertil', 'china', 'wes... | ['gradual', 'convergence', 'fertility', 'china... | 0.0000 | neutral |
| 1 | 1364997075851599873 | 1575190110920114178 | 2022-09-28T18:26:02.000Z | RT @nytimes: South Korea has had the world's l... | south korea world lowest total fertility rate ... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | 0.2415 | positive |
| 2 | 1231317688288468994 | 1575189759550693377 | 2022-09-28T18:24:38.000Z | RT @nytimes: South Korea has had the world's l... | south korea world lowest total fertility rate ... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | 0.2415 | positive |
| 3 | 780710885186674688 | 1575188100078133248 | 2022-09-28T18:18:03.000Z | RT @nytimes: South Korea has had the world's l... | south korea world lowest total fertility rate ... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | ['south', 'korea', 'world', 'lowest', 'total',... | 0.2415 | positive |
| 4 | 178464094 | 1575186284808531969 | 2022-09-28T18:10:50.000Z | RT @koryodynasty: Always trust men to come up ... | always trust come with policies women this cas... | ['always', 'trust', 'come', 'with', 'policies'... | ['always', 'trust', 'come', 'policies', 'women... | ['alway', 'trust', 'come', 'polici', 'women', ... | ['always', 'trust', 'come', 'policy', 'woman',... | 0.7184 | positive |
Below are some basic information of the clean tweet dataset:
tweetDF.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 775 entries, 0 to 774 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 author_id 775 non-null int64 1 id 775 non-null int64 2 created_at 775 non-null object 3 text 775 non-null object 4 clean_text 769 non-null object 5 tweet_tokenized 775 non-null object 6 tweet_nonstop 775 non-null object 7 tweet_stemmed 775 non-null object 8 tweet_lemmatized 775 non-null object 9 sentiment 775 non-null float64 10 label 775 non-null object dtypes: float64(1), int64(2), object(8) memory usage: 66.7+ KB
The chunck of code below extracted each tweet_lemmatized element, making each row's element a list, and then put the list inside a list, so the format of will be looking like this:
[[lemmatized_tweet_1], [lemmatized_tweet_2], [lemmatized_tweet_3], ...]
import ast
tweets = list(tweetDF["tweet_lemmatized"])
out = []
for twt in tweets:
out.append(ast.literal_eval(twt))
tweets = out
The chunk of code below reformats the apriori output into a pandas dataframe with columns "rhs", "lhs","supp", "conf", "supp x conf", "lift":
supp (support): the support of A and B, Sup(A, B), measures how often item-set with A and items in B occur together relative to all other transactions, which scales how common an item-set is: (1=very important, 0=irrelecvant)
conf (confidence): the confidence of A and B, conf(A, B), measures how often items in A and items in B occur together, relative to transactions that contain A, which scales how statistically “strong” a rule is (1=strong rule, Y is bought everytime X is, 0 = no instance of rule occuring)
supp x conf (support x confidence): large support means frequently occurring rules, and large confidence means strong rules. Hence, a large products suggests the rule is frequent and strong
lift: the lift of a rule is the ratio of the observed support to that expected if A and B were independent
# Reformat Output
# INSERT CODE TO RE-FORMAT THE APRIORI OUTPUT INTO A PANDAS DATA-FRAME WITH COLUMNS
# "rhs","lhs","supp","conf","supp x conf","lift"
def reformat_results(results):
# clean-up results
keep = []
for i in range(0, len(results)):
for j in range(0, len(list(results[i]))):
if (j > 1):
for k in range(0, len(list(results[i][j]))):
if (len(results[i][j][k][0]) != 0):
# print(len(results[i][j][k][0]), results[i][j][k][0])
rhs = list(results[i][j][k][0])
lhs = list(results[i][j][k][1])
conf = float(results[i][j][k][2])
lift = float(results[i][j][k][3])
keep.append([rhs, lhs, supp, conf, supp*conf, lift])
if (j == 1):
supp = results[i][j]
return pd.DataFrame(keep, columns = ["rhs","lhs","supp","conf","supp x conf","lift"])
The chunk of code below converts the dataframe of the apriori output to a NetworkX object:
# Utility function: Convert to NetworkX object
def convert_to_network(df):
print(df)
#BUILD GRAPH
G = nx.DiGraph() # DIRECTED
for row in df.iterrows():
# for column in df.columns:
lhs="_".join(row[1][0])
rhs="_".join(row[1][1])
conf=row[1][3]; #print(conf)
if(lhs not in G.nodes):
G.add_node(lhs)
if(rhs not in G.nodes):
G.add_node(rhs)
edge=(lhs,rhs)
if edge not in G.edges:
G.add_edge(lhs, rhs, weight=conf)
# print(G.nodes)
# print(G.edges)
return G
The code chunk below is a function that plots the NetworkX object into network graphs (nodes and edges):
# Utility function: Plot NetworkX object
def plot_network(G):
#SPECIFIY X-Y POSITIONS FOR PLOTTING
pos=nx.random_layout(G)
#GENERATE PLOT
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
#assign colors based on attributes
weights_e = [G[u][v]['weight'] for u,v in G.edges()]
#SAMPLE CMAP FOR COLORS
cmap=plt.cm.get_cmap('Blues')
colors_e = [cmap(G[u][v]['weight']*10) for u,v in G.edges()]
#PLOT
nx.draw(
G,
edgecolors="black",
edge_color=colors_e,
node_size=2000,
linewidths=2,
font_size=8,
font_color="white",
font_weight="bold",
width=weights_e,
with_labels=True,
pos=pos,
ax=ax
)
ax.set(title='Twitter Response (Fertility Hashtag)')
plt.show()
# raise
print("Transactions:\n", pd.DataFrame(out[:5]))
Transactions:
0 1 2 3 4 5 6 \
0 gradual convergence fertility china western country None
1 south korea world lowest total fertility rate
2 south korea world lowest total fertility rate
3 south korea world lowest total fertility rate
4 always trust come policy woman case ingenious
7 8 9 10 11 12 13
0 None None None None None None None
1 year mayor seoul said nanny would encourage
2 year mayor seoul said nanny would encourage
3 year mayor seoul said nanny would encourage
4 plan baby boost None None None None
Train the ARM model using the apriori package, and fit the model on the text data:
# INSERT CODE TO TRAIN THE ARM MODEL USING THE "apriori" PACKAGE
#results = list(apriori(out, min_support=0.003, min_confidence=0.02, min_length=1, max_length=5))
#results = list(apriori(transactions, min_support=0.005, min_confidence=0.05, min_length=1, max_length=5))
results = list(apriori(out, min_support = 0.05, min_confidence = 0.3, min_lift = 4, min_length = 4, max_length = 5))
print(len(results))
4255
The chunk of code will perform the following tasks:
# INSERT CODE TO PLOT THE RESULTS AS A NETWORK-X OBJECT
pd_results = reformat_results(results[:50])
G = convert_to_network(pd_results)
plot_network(G)
rhs lhs supp conf supp x conf lift 0 [always] [baby] 0.068387 0.981481 0.067121 12.469642 1 [baby] [always] 0.068387 0.868852 0.059418 12.469642 2 [always] [boost] 0.068387 0.981481 0.067121 14.086077 3 [boost] [always] 0.068387 0.981481 0.067121 14.086077 4 [always] [case] 0.068387 0.981481 0.067121 13.114623 .. ... ... ... ... ... ... 95 [record] [factor] 0.161290 0.892857 0.144009 5.405971 96 [factor] [shattered] 0.161290 0.976562 0.157510 5.866945 97 [shattered] [factor] 0.161290 0.968992 0.156289 5.866945 98 [ingenious] [plan] 0.068387 1.000000 0.068387 14.351852 99 [plan] [ingenious] 0.068387 0.981481 0.067121 14.351852 [100 rows x 6 columns]
pd_results = reformat_results(results[51:100])
G = convert_to_network(pd_results)
plot_network(G)
rhs lhs supp conf supp x conf \
0 [ingenious] [trust] 0.068387 1.000000 0.068387
1 [trust] [ingenious] 0.068387 1.000000 0.068387
2 [ingenious] [woman] 0.068387 1.000000 0.068387
3 [woman] [ingenious] 0.068387 0.456897 0.031246
4 [mayor] [nanny] 0.101935 0.766990 0.078184
.. ... ... ... ... ...
169 [case] [always, policy] 0.068387 0.913793 0.062492
170 [policy] [always, case] 0.068387 0.828125 0.056633
171 [always, case] [policy] 0.068387 1.000000 0.068387
172 [always, policy] [case] 0.068387 1.000000 0.068387
173 [policy, case] [always] 0.068387 1.000000 0.068387
lift
0 14.622642
1 14.622642
2 6.681034
3 6.681034
4 7.248994
.. ...
169 13.362069
170 12.109375
171 12.109375
172 13.362069
173 14.351852
[174 rows x 6 columns]
pd_results = reformat_results(results[101:150])
G = convert_to_network(pd_results)
plot_network(G)
rhs lhs supp conf supp x conf \
0 [always] [case, woman] 0.068387 0.981481 0.067121
1 [case] [always, woman] 0.068387 0.913793 0.062492
2 [woman] [always, case] 0.068387 0.456897 0.031246
3 [always, case] [woman] 0.068387 1.000000 0.068387
4 [always, woman] [case] 0.068387 1.000000 0.068387
.. ... ... ... ... ...
289 [case] [trust, boost] 0.068387 0.913793 0.062492
290 [trust] [boost, case] 0.068387 1.000000 0.068387
291 [boost, case] [trust] 0.068387 1.000000 0.068387
292 [trust, boost] [case] 0.068387 1.000000 0.068387
293 [trust, case] [boost] 0.068387 1.000000 0.068387
lift
0 14.351852
1 13.362069
2 6.681034
3 6.681034
4 13.362069
.. ...
289 13.362069
290 14.622642
291 14.622642
292 13.362069
293 14.351852
[294 rows x 6 columns]
pd_results = reformat_results(results[151:200])
G = convert_to_network(pd_results)
plot_network(G)
rhs lhs supp conf supp x conf \
0 [boost] [come, ingenious] 0.068387 0.981481 0.067121
1 [come] [boost, ingenious] 0.068387 0.679487 0.046468
2 [ingenious] [come, boost] 0.068387 1.000000 0.068387
3 [come, boost] [ingenious] 0.068387 1.000000 0.068387
4 [boost, ingenious] [come] 0.068387 1.000000 0.068387
.. ... ... ... ... ...
241 [come] [trust, case] 0.068387 0.679487 0.046468
242 [trust] [come, case] 0.068387 1.000000 0.068387
243 [come, case] [trust] 0.068387 0.981481 0.067121
244 [trust, case] [come] 0.068387 1.000000 0.068387
245 [trust, come] [case] 0.068387 1.000000 0.068387
lift
0 14.351852
1 9.935897
2 14.622642
3 14.622642
4 9.935897
.. ...
241 9.935897
242 14.351852
243 14.351852
244 9.935897
245 13.362069
[246 rows x 6 columns]
pd_results = reformat_results(results[1501:1550])
G = convert_to_network(pd_results)
plot_network(G)
rhs lhs supp conf \
0 [said] [lowest, seoul, total] 0.096774 0.833333
1 [seoul] [lowest, said, total] 0.096774 0.757576
2 [total] [lowest, said, seoul] 0.096774 0.903614
3 [lowest, said] [seoul, total] 0.096774 1.000000
4 [lowest, seoul] [said, total] 0.096774 0.986842
.. ... ... ... ...
509 [mayor, world] [nanny, south] 0.099355 0.962500
510 [nanny, south] [mayor, world] 0.099355 0.974684
511 [world, nanny] [mayor, south] 0.099355 1.000000
512 [mayor, world, south] [nanny] 0.099355 0.962500
513 [world, nanny, south] [mayor] 0.099355 1.000000
supp x conf lift
0 0.080645 8.497807
1 0.073314 7.828283
2 0.087447 9.337349
3 0.096774 10.197368
4 0.095501 10.197368
.. ... ...
509 0.095629 9.442247
510 0.096840 9.442247
511 0.099355 7.524272
512 0.095629 9.096799
513 0.099355 7.524272
[514 rows x 6 columns]
pd_results = reformat_results(results[1001:1050])
G = convert_to_network(pd_results)
plot_network(G)
rhs lhs supp conf \
0 [encourage] [nanny, total, south] 0.096774 0.961538
1 [nanny] [encourage, total, south] 0.096774 0.914634
2 [total] [encourage, nanny, south] 0.096774 0.903614
3 [encourage, nanny] [total, south] 0.096774 1.000000
4 [encourage, south] [nanny, total] 0.096774 1.000000
.. ... ... ... ...
549 [year, seoul] [encourage, south] 0.096774 1.000000
550 [year, south] [encourage, seoul] 0.096774 0.914634
551 [encourage, seoul, south] [year] 0.096774 1.000000
552 [encourage, year, south] [seoul] 0.096774 1.000000
553 [year, seoul, south] [encourage] 0.096774 1.000000
supp x conf lift
0 0.093052 9.805162
1 0.088513 9.451220
2 0.087447 9.337349
3 0.096774 9.810127
4 0.096774 10.197368
.. ... ...
549 0.096774 10.333333
550 0.088513 9.451220
551 0.096774 4.813665
552 0.096774 7.828283
553 0.096774 9.935897
[554 rows x 6 columns]
pd_results = reformat_results(results[3831:3850])
G = convert_to_network(pd_results)
plot_network(G)
rhs lhs supp \
0 [mayor] [lowest, seoul, said, total] 0.096774
1 [said] [mayor, lowest, total, seoul] 0.096774
2 [seoul] [mayor, lowest, said, total] 0.096774
3 [total] [mayor, lowest, said, seoul] 0.096774
4 [lowest, mayor] [seoul, said, total] 0.096774
.. ... ... ...
459 [mayor, world, total] [lowest, seoul] 0.098065
460 [world, seoul, total] [mayor, lowest] 0.098065
461 [lowest, world, mayor, seoul] [total] 0.098065
462 [lowest, world, mayor, total] [seoul] 0.098065
463 [lowest, world, seoul, total] [mayor] 0.098065
conf supp x conf lift
0 0.728155 0.070467 7.524272
1 0.833333 0.080645 8.497807
2 0.757576 0.073314 7.828283
3 0.903614 0.087447 9.337349
4 0.949367 0.091874 9.810127
.. ... ... ...
459 0.962025 0.094341 9.810127
460 1.000000 0.098065 9.810127
461 1.000000 0.098065 9.337349
462 0.962025 0.094341 7.531006
463 1.000000 0.098065 7.524272
[464 rows x 6 columns]
It is worth noting that the network plot above only displays partial relations of partial parts of words in the Twitter Response dataset. Due to the reason that the size of the result after fitting Apriori model on the provided data is too large, it is time-consuming and nearly impossible to render and display all the relations and connections existed in the lemmatized Tweet data in a single plot. Therefore, I divided the result set into several subsets and randomly selected some of the subsets to plot.
From the plots we can also see that although there exists numerous words and connections in the lemmatized tweet dataset, those keywords (nodes) and connections (links) seem to be highly repetitive. In other words, the insights that can be derived from this network analysis are very limited due to the reason that many Tweets in the dataset are reposts, which means the content can be highly repetitive.
However, there still exists valuable observations that can be extracted from the current information we have.
Based on the network plots above, we can see that there exists strong connections between the keywords "baby", "woman", "boost", "policy", and "seoul". From these words, we can assume a possible scenario that the city of Seoul (South Korea) might try to boost the baby amount by improving their policies of woman rights. Moreover, words like "woman", "plan", "policy", and "encourage" seem to be always connected, which indicates the possibilty that all or many East Asian countries/regions might plan to modify or introduce some kind of policies to encourage women to give birth in order boost the fertility rate of the area.